Computational Techniques For Improved Name Search

نویسندگان

  • Beatrice T. Oshika
  • Filip Machi
  • Bruce Evans
  • Janet Tom
چکیده

This paper describes enhancements made to techniques currently used to search large databases of proper names. Improvements included use of a Hidden Markov Model (HMM) statistical classifier to identify the likely linguistic provenance of a surname, and application of language-specific rules to generate plausible spelling variations of names. These two components were incorporated into a prototype front-end system driving existing name search procedures. HMM models and sets of linguistic rules were constructed for Farsi, Spanish and Vietnamese surnames and tested on a database of over 11,000 entries. Preliminary evaluation indicates improved retrieval of 20-30% as measured by number of correct items retrieved. 1.0 I N T R O D U C T I O N This paper describes enhancements made to current name search techniques used to access large databases of proper names. The work focused on improving name search algorithms to yield better matching and retrieval performance on data-bases containing large numbers of non-European 'foreign' names. Because the linguistic mix of names in large computer-supported databases has changed due to recent immigration and other demographic factors, current name search procedures do not provide the accurate retrieval required by insurance companies, state motor vehicle bureaus, law enforcement agencies and other institutions. As the potential consequences of incorrect retrieval are so severe (e.g., loss of benefits, false arrest), it is necessary that name name search techniques be improved to handle the linguistic variability reflected in current databases. Our specific approach decomposed the name search problem into two main components: • Language classification techniques to identify the source language for a given query name, and Name association techniques, once a source language for a name is known, to exploit language-specific rules to generate variants of a name due to spelling variation, bad transcriptions, nicknames, and other name conventions. A statistical classification technique based on the use of Hidden Markov Models (HMM) was used as a language discriminator. The test database contained about 11,000 names, including about 2,000 each from three target languages, Vietnamese, Farsi and Spanish, and 5,000 termed 'other' to broadly represent general European names. The decision procedures assumed a closed-world situation in which a name must be assigned to one of the four classes. Language-specific rules in the form of context-sensitive, string rewrite rules were used to generate name variants. These were based on linguistic analysis of n a m i n g conven t i ons , pronunciations and common misspellings for each target language. These two components were incorporated into a front-end system driving existing name search procedures. The front-end system was implemented in the C language and runs on a VAX-11/780 and Sun 3 workstations under Unix 4.2. Preliminary tests

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Content Aware Image Retargeting Using Strip Partitioning

Based on rapid upsurge in the demand and usage of electronic media devices such as tablets, smart phones, laptops, personal computers, etc. and its different display specifications including the size and shapes, image retargeting became one of the key components of communication technology and internet. The existing techniques in image resizing cannot save the most valuable information of image...

متن کامل

EFFICIENCY OF IMPROVED HARMONY SEARCH ALGORITHM FOR SOLVING ENGINEERING OPTIMIZATION PROBLEMS

Many optimization techniques have been proposed since the inception of engineering optimization in 1960s. Traditional mathematical modeling-based approaches are incompetent to solve the engineering optimization problems, as these problems have complex system that involves large number of design variables as well as equality or inequality constraints. In order to overcome the various difficultie...

متن کامل

A multi Agent System Based on Modified Shifting Bottleneck and Search Techniques for Job Shop Scheduling Problems

This paper presents a multi agent system for the job shop scheduling problems. The proposed system consists of initial scheduling agent, search agents, and schedule management agent. In initial scheduling agent, a modified Shifting Bottleneck is proposed. That is, an effective heuristic approach and can generate a good solution in a low computational effort. In search agents, a hybrid search ap...

متن کامل

An Improved Modified Tabu Search Algorithm to Solve the Vehicle Routing Problem with Simultaneous Pickup and Delivery

The vehicle routing problem with simultaneous pickup and delivery (VRPSPD) is a well-known combinatorial optimization problem which addresses provided service to a set of customers using a homogeneous fleet of capacitated vehicles. The objective is to minimize the distance traveled. The VRPSPD is an NP-hard combinatorial optimization problem. Therefore, practical large-scale instances of VR...

متن کامل

An Improved Modified Tabu Search Algorithm to Solve the Vehicle Routing Problem with Simultaneous Pickup and Delivery

The vehicle routing problem with simultaneous pickup and delivery (VRPSPD) is a well-known combinatorial optimization problem which addresses provided service to a set of customers using a homogeneous fleet of capacitated vehicles. The objective is to minimize the distance traveled. The VRPSPD is an NP-hard combinatorial optimization problem. Therefore, practical large-scale instances of VR...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1988